Big Data Computing：Unifying Batch and Stream Processing

The previous two articles — [Big Data Computing：Batch Processing](https://xx/Big Data Computing：Batch Processing) and [Big Data Computing：Real-Time Processing](https://xx/Big Data Computing：Real-Time Processing) — introduced the principles, architectures, frameworks, application scenarios, and limitations of batch and real-time computing. In [Big Data Computing：Batch Processing vs. Real-Time Computing](https://xx/Big Data Computing：Batch Processing vs. Real-Time Computing), we compared the two approaches across multiple dimensions to understand their characteristics, limitations, and use cases.

This article explores how batch and real-time processing — originally two parallel development paths — have gradually converged into stream-batch unification due to evolving business needs and technological progress. We’ll analyze why unification became necessary, its core principles, and its architecture.

Why Stream-Batch Unification?

Limitations of Traditional Architectures

In traditional big data platforms, batch and real-time processing followed parallel tracks:

Batch processing excels at analyzing massive historical datasets in bulk with high latency. Common frameworks: Hadoop, Spark, Hive.
Real-time processing handles continuous data streams with millisecond- or second-level latency, but lacks the ability to analyze full datasets. Common frameworks: Flink, Paimon, Kafka.

This separation introduces several problems:

Duplicate business logic: The same business requirement often needs two separate implementations — one for batch and one for streaming.
Data inconsistency: Different data paths in batch and stream pipelines lead to result discrepancies.
High cost: Two sets of programs must be developed and maintained by different teams.
Low resource utilization: Separate frameworks occupy independent clusters, each requiring reserved capacity for peak loads.

Business Drivers

As businesses demand both real-time responsiveness and historical analysis, maintaining two systems becomes increasingly costly. This drove exploration into unifying batch and stream processing.

What Is Stream-Batch Unification?

Stream-batch unification means using a single computation engine and programming model to support both batch and real-time workloads while ensuring consistent results. Its key characteristics are:

Unified engine: Internally adapts to batch or streaming modes.
Unified API: Developers write code once, runnable in either mode.
Unified data path: Supports bounded and unbounded datasets from the same source.
Unified resources: Shared compute resources improve utilization.

Architecture of Stream-Batch Unification

A typical unified architecture consists of four layers: data sources, compute engine, storage, and applications.

Data sources: Use Kafka, Pulsar, or sockets to continuously feed data.
Compute engine: Unified APIs (e.g., Flink SQL/Table API) abstract business logic. Execution plans adapt dynamically to batch or stream inputs.
Storage: Results written to HDFS, data lakes, Kafka, or NoSQL stores, depending on use case.
Applications: BI dashboards, data warehouse queries, monitoring/alerting, recommendation, and risk control.

Technology Stack

1. Messaging Layer

Role: Data ingestion, buffering, and distribution.
Tech: Kafka, Pulsar.

2. Compute Engine

Role: Provide unified APIs and execution plans.
Tech: Flink.

3. Data Lake

Role: Unified storage formats with stream-batch read/write support, ensuring visibility and transactional integrity.
Tech: Hudi, Delta Lake, Paimon.

4. Query Engines

Role: Low-latency query responses.
Tech: Redis, ClickHouse, Doris.

Despite being relatively new, the ecosystem for stream-batch unification is already rich, with multiple technical options tailored for different application scenarios. Typical examples include:

1. Real-Time Data Warehouse

E-commerce and social platforms rely heavily on real-time data warehouses for second-level metrics and long-term trend analysis. A common setup includes:

Kafka for ingesting business event streams.
Flink for real-time aggregation and computation.
Results stored in ClickHouse or Doris for sub-second querying.
Offline jobs ingesting into Hudi for deep historical analysis.
Unified Flink SQL ensures consistent logic across batch and streaming.

2. Real-Time Financial Risk Control

Financial platforms often need to monitor transactions in real time, intercept suspicious activity, and maintain historical data for retrospective analysis. A typical setup:

Kafka ingests risk-control event streams.
Flink performs real-time rule-based monitoring.
Daily offline jobs train risk models on historical data.
Both batch and stream tasks share the same Flink Table API logic.

Conclusion

The rise of real-time computing highlighted the value of fresh data, while batch processing underscored the importance of historical insights. This convergence has driven technology away from two parallel paths toward stream-batch unification — not just as a technical combination, but as an architectural paradigm shift.